Warcbase: 

Using Scalable Web Analytics 

to Analyze Canadian 
Collections En Masse 


Ian Milligan 

Assistant Professor 
@ianmilliganl 

Nick Ruest 

Digital Assets Librarian 

@ruebot 


UNIVERSITY OF WATERLOO 

FACULTY OF ARTS 

Department of History 


YORK 

U NjV_E_R S I J J 
UNIVERSITY 




The Web as a Primary 

Source 


Web archives will fundamentally affect the way 
historians write history 

• We will have easier access to information on a 
previously-unknown scale, as well as improved 
capability to parse it; 

• This information will be left by people who rarely 
before left historical sources; 



We live and experience our 
lives online, and historians 

need to study this. 



And it’s all 
happening on a 

scale 
historians 
have difficulty 
imagining! 
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Historians and other 
humanists can’t do it 

alone 



We need 
collaboration! 



Web Archives for Historical 

Research 


Historians 


Computer Scientists 





Warcbase 


Jimmy Lin (main developer, 
CS/lead), Ian Milligan 
(co-lead, history), Jeremy 
Wiebe (history/PhD), Alice 
Zhou (computer science, 
undergrad), Youngbin Kim 
(computer science, 
undergrad), Nick Ruest 
(librarian @ York) 
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Setup ▼ Web Archives - Tweets - Temporal Browsing - 


Q Search 


Warcbase 


twittar.com 


facebook.com 
youtubo.com . 


Currently using it on the 
GeoCities and Canadian 
Politics web archives, as 
well as WALK (61 Archive-lt 
collections, 6 institutions) 


Warcbase is an open-source platform for managing web archi\ 
The platform provides a flexible data model for storing and mi 
metadata and extracted knowledge. Tight integration with Hac 
analytics and data processing via Spark. For more information 
behind it, visit our about page. 

Our documentation can be accessed by using the drop-down r 

Getting Started 

You can download Warcbase here. The easiest way would be ti 
tutorial. For a conceptual and practical introduction to the conr 
and James Baker's "Introduction to the Bash Command Line" a 

Using Warcbase 


docs.warcbase.org 



Extracting Domain 
Level Plain Text 


Extracting Domain Level Plain Text 


All plain text 

Plain text by domain 

Plain text by URL 
pattern 

Plain text minus 
boilerplate 

Plain text filtered by 
date 

Plain text filtered by 
language 

Plain text filtered by 
keyword 


All plain text 

This script extracts the crawl date, domain, URL, and plain text from HTML files in the sample 
ARC data (and saves the output to out/). 

import org.warcbase. spark. rdd. RecordRDD._ 

import org.warcbase. spark. matchbox. {RemoveHTML, RecordLoader} 

Record Loader. loadArchives( ,l src/test/resources/arc/example.arc.gz ,, , sc) 

. keepValidPages( ) 

.map(r=> ( r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML( r.getContent 
String) ) ) 

. saveAsText File ( "out/" ) 

If you wanted to use it on your own collection, you would change 

"src/test/resources/arc/example.arc.gz" to the directory with your own ARC or WARC files, and 
change "out/" on the last line to where you want to save your output data. 


Note that this will create a new directory to store the output, which cannot already exist. 


If you want to run it in your Spark Notebook, the following script will show in-notebook plain 
text: 

val r = RecordLoader. loadArchives("/path/to/warcs" f sc) 

. keepValidPages( ) 

.map(r => { 





Installing Warcbase 

Install Spark, clone 


warcbase, build with 
maven 

Launch Spark Shell 
Run simple scripts 
(begin by 

copy-and-replacing 
our existing ones to 
get a feel for the 
interface) 

In the future, hoping to 
move to PySpark 


• • • 2. ssh 


2017-02-20 19:49:53,348 [main] INFO Utils - Successfully started service 'HTTP class serv 
er' on port 55947. 

Welcome to 

/~I7_ n_ 

_\ \/ _ \/ _ v _/ •_/ 

/ / • /\_,_/_/ /_/\_\ version 1.5.1 

/_/ 

Using Scala version 2.10.4 (lava HotSpot(TM) 64-Bit Server VM, lava 1.8.0_111) 

Type in expressions to have them evaluated. 

Type :help for more information. 

2017-02-20 19:49:58,659 [main] INFO SparkContext - Running Spark version 1.5.1 

2017-02-20 19:49:58,691 [main] INFO SecurityManager - Changing view acls to: i2millig 

2017-02-20 19:49:58,692 [main] INFO SecurityManager - Changing modify acls to: i2millig 

2017-02-20 19:49:58,692 [main] INFO SecurityManager - SecurityManager: authentication dis 

abled; ui acls disabled; users with view permissions: Set(i2millig); users with modify per 
missions: Set(i2millig) 

2017-02-20 19:49:59,261 [sparkDriver-akka . actor . default-dispatcher-2] INFO Slf4jLogger - 
Slf4j Logger started 

2017-02-20 19:49:59,394 [sparkDriver-akka . actor . default-dispatcher-2] INFO Remoting - Sta 
rting remoting 

2017-02-20 19:49:59,899 [sparkDriver-akka . actor . default-dispatcher-3] INFO Remoting - Rem 
oting started; listening on addresses : [akka . tcp : //sparkDriver@130 . 63 . 180 . 18 : 57930] 
2017-02-20 19:49:59,914 [main] INFO Utils - Successfully started service 'sparkDriver' on 
port 57930. 

2017-02-20 19:49:59,936 [main] INFO SparkEnv - Registering MapOutputTracker 
2017-02-20 19:49:59,972 [main] INFO SparkEnv - Registering BlockManagerMaster 
2017-02-20 19:49:59,998 [main] INFO DiskBlockManager - Created local directory at /tmp/bl 
ockmgr-76daf 3c7-f9cc-4f0e-8d53-8c564f285el3 





• # # M Inbox - ianrr x Reviews an c x Search resu x ) My Drive - C x (Q WASAPI - G x Work on Go x My Drive - ( x web-archive x '■ la n 


<- C O ( A GitHub, Inc. [US] https://github.com/web-archive-group/warcbase_workshop_vagrant Q. ☆ | Q 4 S 9 O : 

■i: Apps M GMail ^ Lib Q GitHub ft AWS HistD © RSS ft Globe Q WALK •}: LEARN FT Blacklight Q Blacklight repo ^ Concur » S Other Bookmarks 


ES README. md 

Warcbase workshop vm 

Digital pedagogy FTW 

Introduction 

This is a virtual machine for Warcbase workshops. Warcbase documentation can be found here. 

https://github.com/web 

The virtual machine that is built uses 2GB of RAM. Your host machine will need to be able to support that. 

It requires a lot of data. If you are attending a workshop at a conference, we strongly recommend downloading 
everything beforehand. 

-archive-group/warcbas 

Coursework can be found in the coursework subdirectory. 

Requirements 

Download each of the following dependencies. 

e_workshop .vagrant 

1. VirtualBox 

2. Vagrant 

3. Git 

Virtual Machine 

To install this virtual machine, you have two options. 

You can download it from this link and "import the appliance" using VirtualBox. Note that this is a 6.4GB download. If 
you do this, skip to "Spark Notebook" below. 

Or you can use vagrant to build it yourself, or provision it using aws . 

Use 

You'll need to get your virtual machine running on the command line. For a basic walkthrough of how to use the 
command line, please consult this lesson at the Programming Historian. 

From a working directory, please run the following commands. 

1. git clone https://github.com/web-archive-group/warcbase_workshop_vagrant.git (this Clones this repository) 

2. cd warcbase_workshop_vagrant (this changes into the repository directory) 

3. vaerant ud ^this builds the virtual machine - it will take a while and download a lot of data) 


Extract all Text 
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1. i2millig@rho: 

~/derivatives/cpp. all. plaintext (ssh) 


Python 

bash 

bash 

bash 

i2millig@rho:... 

i2millig@rho:... 

bash 



^20060222, liberal. ca, http: //liberal. ca/bio_e.aspx?&id=35049 J Liberal Party of Canada HOME THE TEAM THE P 
ARTY MEDIA CENTRE COMMISSIONS YOUR RIDING Omar Alghabra www.omaralghabra.com Home > Mississauga--E 
rindale Riding Map (PDF) Omar Alghabra came to Canada at a very young age, and immediately knew Canada 
was his home. He is was first elected 2006 as the Member of Parliament for Mississauga-Erindale. Mr. A1 
ghabra is an experienced entrepreneur. For the past six years, he has worked for a large multinational 
corporation, carrying out different responsibilities including quality assurance, project management, s 
ales, contract management and management of a complete department handling a global mandate. Mr. Alghab 

ra is an active member of his community. He is the former National President of the Canadian Arab Feder 

ation (2004-2005) and a former member of the Community Editorial Board for the Toronto Star. (2003-2004 

). Mr. Alghabra is currently a member of the Diversity Council for General Electric Canada and is activ 

e in Dunior Achievement for the Toronto Region. He was a member of the Multicultural Inter-Agency of Pe 
el from 2001 to 2002. Mr. Alghabra has a degree in Mechanical Engineering from Ryerson University and a 
Masters in Business Administration (MBA) from York University. Omar Alghabra 790 Burnamthorpe West, 
Unit 10 905-276-2806 info@omaralghabra . ca Riding President Elias Hazineh Send an email 

Home | News | Your Riding | Contact Us | francais This website is the property of the 
Liberal Party of Canada and may not be reproduced in whole or in part without express written permissio 
n. © Liberal Party of Canada 2006. All rights reserved. Authorized by the registered agent for the Libe 
ral Party of Canada. Privacy Policy) 

(20060222, liberal. ca, https: //liberal. ca/news_e.aspx?id=11470, Liberal. ca HOME THE TEAM THE PARTY MEDIA C 
ENTRE COMMISSIONS YOUR RIDING Celebrating our National Flag February 15, 2006 February 15 is Nationa 
1 Flag Day in Canada, which marks the 41st anniversary of the first raising of the maple leaf flag on P 
arliament Hill. Today is a celebration of our shared values, common citizenship and sense of pride in t 
his great country we call home. The Canadian flag is one of the most recognizable symbols in the world 
and flies proudly to remind us all of who we are and where we come from. The maple leaf's symbolic orig 
ins date back to the beginning of our nation's history, while the red and white bars on the flag repres 
ent strength and unity. Canada's flag was adopted in 1964 under the courageous leadership of Liberal Pr 
ime Minister Lester B. Pearson. The idea of changing the Red Ensign which featured the Britain's Union 
Dack, was very controversial at the time, with the Conservative Party strongly opposed to changing the 
status quo. Facing strong Conservative resistance in the House of Commons, Pearson's minority governme 
nt fought hard in the name of national unity and Canada's multicultural future to make the new flag a r 
eality. In an impassioned speech to the House of Commons, Pearson said: "Mr. Speaker, it is for this g 
eneration, for this Parliament, to give them and to give us all a common flag; a Canadian flag which, w 
hile bringing together but rising above the landmarks and milestones of the past, will say proudly to t 
he world and to the future: I stand for Canada." Thanks to Pearson's courageous leadership, Canadians a 
cross this great nation celebrate our flag and what it stands for - a country and a citizenship that ar 
e the envy of the world. Home | News | Your Riding | Contact Us | fran 

cais This website is the DroDertv of the Liberal Partv of Canada and mav not be reoroduced in whole or 
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greenparty-2005-all-text-boilerpiped 

Margaret Mead Admin Username Farewell to Judith 
Green Party members and supporters 
mourn the loss of a dear friend and devoted 
colleague. Judith Hamel passed away Sunday. 

August 21 2005 at Moncton City Hospital after battling 
cancer for more than a year. 

Judith brought new leadership to the Green Party of 
Canada. As an accomplished author, 
entrepreneur and community organizer, she proved 
herself a serious political contender 
when she earned the highest number of votes among 
Green Party candidates in New Brunswick 
Andrea Prazmowski. The Ottawa Citizen, Aug 20, 

2005 Being Green isn't a decision that 

people make once every four years It's a way of life 

that many people embrace. David 
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This corpus has 1 document with 6.746.516 total words and 21,959 unique word 
forms. Created about 2 minutes ago 

Mos: frequent words n the corpus: party {299143): green <298576:. Canada <29053t), 
acetH (278853): thief (278754) 
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Martin (Lib): mart.. 

harper 
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Martin (Lib) mart. . 

harper 
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Martin (Lib) mart.. 

harper 

(Con): harper.s@parl gc.ca Jack La... 

1... 

January 2005) - C.. 

harper 

is ready to show Canadians 
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an ad campaign t . 
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1... 
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Extract Entities 


200606 

Andrew Lewis 

Bill 

Bill Hulet 
Brown 
Bruce Abel 
Bush 

Camille Labchuk 
Chandler 
Cherfi 

Chernushenko 

David 


David Chernsuhenko 
koDavid Chernushenko 


David Kay 
Derek Pinto 

Ed Broadbent 

Elizabeth May 
Eric Walton 
Fannon 
Gomery 
Green 
Harper 

Harris 

Jim 

Jim Fannon 

Jim Harris 

Jim Harris Speech 
John 

Julie Baribeau 
Junker 
Kevin Colton 
Labchuk 
Layton 

Leonardo DiCaprio 
Manley 

Mark Brooks 
Mark MacGillivray 

Martin 

Michael Robinson 
Milliken 

Paul Martin 


200607 

Adrianne Carr 
Andrew Lewis 
Bill 

Bill Hulet 
Brown 
Bruce Abel 
Bush 

Camille Labchuk 
Chandler 
Cherfi 

Chernushenko 

David 

David Chernsuhenko 

David Chernushenko 

David Chernushenko E... 

David Kay 
Derek Pinto 
Dietrich 

Ed Broadbent 

Elizabeth May 
Eric Walton 
Fannon 
Gomery 
Green 
Harper 

Harris 

Jim 

Jim Fannon 
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Adrianne Carr 
Allan Gribbin 
Amelie Gingras 
Andrew Lewis 
Bill 

Bill Hulet 
Brown 
Bruce Abel 
Bush 

Chandler 

Cherfi 

Chernushenko 

Clements Verhoeven 

David 

David Chernushenko 

David Kay 
Derek Pinto 
Dietrich 

Ed Broadbent 
Elizabeth May 
Eric Walton 
Fannon 

Gomery 
Green 


Jim Harris 


Harper 

Harris 

Jim 

Jim Harris 

Jim Harris Speech 


Jim Harris Speech 

John 

Julie Baribeau 
Junker 
Kevin Colton 

Labchuk 

Layton 
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John 

Junker 
Kevin Colton 
Kootenay-Columbia Jo... 
Labchuk 
Lawrence Redfern 
Layton 

Manley 




200609 
Adrianne Carr 
Amelie Gingras 
Brown 
Bruce Abel 

Bush 

Cameron Wigmore 
Chandler 
Cherfi 

Chernushenko 

Chretien 

David 

David Chernushenko 

David Kay 
Derek Pinto 
Dietrich 
Dion 

Elizabeth 

Elizabeth May 

Elizabeth Peloza 
Eric Walton 
Gomery 
Green 
Harper 

Harris 

Jasper 
Jim 

Jim Harris 

Jim Harris Speech 
John 
Labchuk 
Lougheed 
Mackenzie 
Manley 
Martin 
May 

Mona Elaine Adilman ... 

Paul Martin 

Peter Foster 
Pierre Pettigrew 
Schiller 


Elizabeth May 
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200610 
Ambrose 
Andrew Lewis 
Bill 

Bridget Doherty 
Bush 
Carol Gudz 

Catharine Johannson 
Chandler 
Cherfi 

Chernushenko 
Daphne Wysham 
David 

David Chernushenko 
David Kay 

Derek 

Derek Pinto 
Dundas 


Elizabeth 

Elizabeth Goes 

Elizabeth May 


Elizabeth May Say 
Eric Walton 
Gagnon 
Gomery 
Green 

Grenon 

Halton 

Harper 

Harris 

Jim 

Jim Harris 

John 

Jude Larkin 
Judith 
Kyle Grice 
Labchuk 
Manley 

Mark MacGillivray 
Martin 
May 

Melanie Ransom 
Michael Grayson 
Michele 

Paul Martin 
Richard Reble 
Sharon Labchuk 


20061 1 
Ambrose 
Andrew Lewis 

Bill 

Bill Clinton 
Bush 
Chandler 
Cherfi 

Chernushenko 
Chris Alders 
Daphne Wysham 
David 

David Chernushenko 

David Cox 

David Kay 

David Suzuki 
Derek 

Derek Pinto 

Dundas 

Edward Burtynsky 

Elizabeth 

Elizabeth May 

Eric Walton 
Garth Turner 
Gomery 
Green 

Halton 

Harper 

Harris 

Jim 

Jim Harris 

Jim Harris Speech 
John 

Julie Baribeau 
Labchuk 
Manley 

Margaret 

Mark MacGillivray 
Martin 
May 
Paul 

Paul Martin 

Ross 

Sharon Labchuk 
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Distribution of Locations 


tr 
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ln[95]:= int = Semanticlnterpretation [ #] t /6 proccsscdf req [ [All , 1]] 

o-i[95|= ( [ Canada ) , ( Calgary ] , [ Colombia) , | Montreal ) , $Failed, [ Ontario, Canada j , (Afghanistan!, 

( Ottawa] , ( Manitoba, Canada j , [ British Columbia, Canada | , | Toronto j , ( Nova Sootia, Canada) , 

( Saskatchewan, Canada ] , Quebec, Canada , Alberta, Canada , Winnipeg , Washington , 

Canada | , | Nunavut, Canada | , | White Horse | , | United States ! , | Petawawa | , | Mumbai | , 

( Lima) , [a The Americas) , f Saskatoon | , [ Peru ) , [ Edmonton ] , [ Mexico ) , [ ChicoutiminJonquiere] , 

[ New Brunswick, Canada ] , [ United States ] , | Newfoundland and Labrador, Canada ) , | Qandahar , 
SFailed, ( Asia-Pacific ) , [ Newfoundland and Labrador, Canada ) , [ Victoria ] , [ United States ] , 

I Quebec City ) , [ India ] , SFailed, [ Atlantic Ocean ) , ( St. Germain ) , [ Durham ) , ( Battle of Waterloo ) , 
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Finding Sites of Interest 
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Domain: twitter.com 

PageRank: 1.7401636129266924 

Links in: 36267 

Links out: 3 

Total links: 36270 
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Factor 2 (15.8%) 


Exploring collection coverage 

and curatorial models 



floods 


633 


energy 


7 

30 


103 


187 


87 


2608 


university 


So we have this 
tool... how have we 

used it? 



Scaling up 


Web Archives for 
Longitudinal 
Knowledge 


Ian Milligan (Co-PI, UW) + 
Nick Ruest (Co-PI, York), 
w/ Geoff Harder, Todd 
Suomela, Sonya Betz, 
Peter Binkley, Geoffrey 
Rockwell, Umar Qasim 
(Alberta), Jefferson Bailey 
(Internet Archive), and 
John Simpson (Compute 
Canada). 


• • • 
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Welcome to the Web Archives for Longitudinal 

\ ' - - 


Institution 

□ 

Knowledge (WALK) Project. Spearheaded by the 

University of Waterloo, York University, and the 

University of Alberta, we are bringing together 

Canadian partners to provide access to their 
collections. 



Collection Name 

□ 



Collection Number 

□ 

Our partners 









Our project provides access to the holdings of six partners across Canada. For each collection, click on the 


logo below to be brought to collections information, datasets, and search portals. Or dig in above to search 
them all! 


CD 


UNIVERSITY OF ALBERTA 

W LIBRARIES 



DALHOUSIE 

UNIVERSITY 


SFU 


SIMON FRASER UNIVERSITY 

ENGAGING THE WORLD 


UNIVERSITY OF TORONTO 

LIBRARIES 


Libraries 


University 
of Victoria 

^ The University of Winnipeg 
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Web Archives for 
Longitudinal 
Knowledge 

6 universities (SFU, Toronto, 
Alberta, Victoria, Winnipeg, 
and Dalhousie) 

61 collections 

1 6 TB of web archival 
collections 

Ingesting them all to generate 
warcbase derivatives and 
searchable index 
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Search □ 


Limit your search 


General Content Type □ 

audio 

0 

excel 

0 

html 

0 

image 

0 

other 

0 

pdf 

0 

powerpoint 

0 

text 

0 

video 

0 

word 
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Web Archives For 
Longitudinal Knowledge 


Derivative Data Old Portal 
GitHub About 


Big Data is reshaping the historical profession and 
society in ways we are only now beginning to grasp. 
Tremendous new opportunities are opening up for 
social and cultural historians. Large web archives 
contain billions of webpages, from personal 
homepages to professional or academic websites, 
offering the ability to reconstruct large-scale aspects of 
the recent past. Yet the sheer size of these primary 
sources presents significant challenges: if the norm 
until the digital era was to have human information 
vanish, “now expectations have inverted. Everything 
may be recorded and preserved, at least potentially” 

(as James Gleick noted in his 2012 The Information: A 
History, a Theory, a Flood). 

Welcome to the Web Archives for Longitudinal 
Knowledge (WALK) Project. Spearheaded by the 

University of Waterloo, York University, and the 
University of Alberta, we are bringing together 
Canadian partners to provide access to their 
collections. 



facebook.com 


youtube.com . 



Our partners 


Our project provides access to the holdings of six partners across Canada. For each collection, click on the 
logo below to be brought to collections information, datasets, and search portals. Or dig in above to search 
them all! 
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Web Archives for 
Longitudinal 
Knowledge 


Mass Ingest Script to take 
each collection and: 

• Extract hyperlinks and 
generate gephi files; 

• Extract all URLs and domain 
counts; 

• Extract plain text; 


• • • 


2. ssh 


ubunt u@compute- Canada -will -accidentally -delete -t his : -/production} cat template. scala 
import org. ware base. spark. matchbox. _ 
import org.warcbase. spark. rdd . RecordRDD._ 

import org.warcbase.spark.matchbox.{RemoveHTML, RecordLoader, ExtractBoilerpipeText} 
val ${ COL LECTION} = RecordLoader . loadArchives ( "/data/${COLLECTION}/* . gz" , sc) . keepValidPag 
es().map(r => (r.getCrawlMonth, ExtractDomain(r .getUrl) ) ) . countItems( ) . saveAsTextFile("/da 
ta/derivatives/urls/${COLLECTION}" ) 

RecordLoader. loadArchives("/data/${COLLECTION}/*.gz", sc ) . keepValidPages( ) . map(r => (r.get 
CrawlDate, ExtractLinks(r .getUrl, r. getContentString))). flatMap(r => r._2.map(f => (r._l, 
ExtractDomain(f ,_1) . replaceAll( " A \\s*www\\ . " , "" ) , ExtractDomain(f ,_2) . replaceAll( " A \\s*ww 
w\\."j '"')))). filter(r => r._2 != "" && r._3 != "" ) .countItems( ) .filter(r => r._2 > 5).sav 
eAsText File ("/data/derivatives/links/${COL LECTION}" ) 

val ${COLLECTION}gephi = RecordLoader . loadArchives( "/data/${COLLECTION}/* . gz" , sc) .keepVa 
lidPages( ) .map(r => (r .getCrawlDate, ExtractLinks(r .getUrl, r . getContentString) ) ) . f latMap( 
r => r._2.map(f => (r._l, ExtractDomain (f ,_1) . replaceAll( " A \\s*www\\. " , ""), ExtractDomain 
(f ._2) . replaceAll(" A \\s + www\\. " , "")))) .filter(r => r._2 != "" && r._3 != ”").countItems() 
.filter(r => r._2 > 5) 

WriteGDF(${COLLECTION}gephi, "/data/derivatives/gephi/${COLLECTION} . gdf " ) 
RecordLoader.loadArchives("/data/${COLLECTION}/*.gz" J sc ) . keepValidPages( ) . map(r => (r.get 
CrawlMonth, r.getDomain, r. getUrl, ExtractBoilerpipeText (r. getContentString) )). saveAsTextF 
ile("/data/derivat ives/text /${COL LECTION}") 
exit 


ubunt u@compute- Canada -will -accidentally -delete -t his :~/production$ | 


WebArchives.ca 

(Blacklight) 
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!O17-02-21 21:46:27 ERROR HTHLAnalyser:134 - Failed to canonicalise host: French Language Sumner Camps Level I: org. apache. commons. httpcllent.URIExceptlon: gnu. Inet. encoding. IONAExceptlon: Contains 
lon-LDH characters. french%20language%20summer%20camps%20level%20l 

!017-02-21 22:07:45 ERROR HTMLAnalyser:134 - Failed to canonicalise host: josephmoreau.schoolappolntments. 

com: org. apache. commons. httpcllent.URIExceptlon: gnu. Inet. encoding. IONAExceptlon: Contains non-LDH characters. josephmoreau.schoolappolntments.%20com 
!017-02-21 22:07:52 ERROR HTHLAnalyser:134 - Failed to canonicalise host: josep 

hmoreau.schoolappolntments.com: org. apache. commons. httpcllent.URIExceptlon: gnu. Inet. encoding. IDNAExceptlon: Contains non-LDH characters, josep%20hmoreau.schoolappolntments.com 
!017-02-21 22:21:28 INFO Instrument: 155 - Performance statistics 
JARCIndexerCommand.maln#total(#=0, tlme=0.00ms, avg=0.00#/ms 0.00ms/#, 0.00%) 

WARCIndexerCommand.parseWarcFlles#docdellvery (#=2089123, tlme=1706007.36ms, avg=1.22#/ms 0.82ms/#, 2.07%) 

WARCIndexerCommanc.checkSubmlsslon#solradd(#=41782, tlme=1701567.49ms, avg=0.02#/ms 40.72ms/#, 2.06%) 

WARCIndexerCommand.parseWarcFlles#startup(#=l, tlme=3372.23ms, avg=0.00#/ms 3372.23ms /#, 0.00%) 

WARCIndexerCommand.commlt#success(#=30, tlme=2663523.35ms, avg=0.00#/ms 88784.11ms/#, 3.23%) 

WARCIndexerCommand.parseWarcFlles#fullarcprocess(#=30, tlme=1445718780.22ms, avg=0.00#/ms 48190626.01ms/#, 1754.28%) 

WARCIndexerCommand.parseWarcFlles#solrdocCreatlon(#=8651855, tlme=77583908.73ms, avg=0.11#/ms 8.97ms/#, 94.14%) 

WARCIndexer.extract#total(#=1740156, tlme=74867498.43ms, avg=0.02#/ms 43.02ms/#, 90.85%) 

TextAnalyzers#total(#=1740156 , tlme=5724356.71ms, avg=0.30#/ms 3.29ms/#, 6.95%) 

PostcodeAnalyzer(#=1701619, tlme=46774.75ms, avg=36.38#/ms 0.03ms/#, 0.06%) 

LanguageAnalyzer#total(#=1701619, tlme=5334955.20ms, avg=0.32#/ms 3.14ms/#, 6.47%) 

LanguageDetector#startup(#=l, tlme=385.84ms, avg=0.00#/ms 385.84ms/#, 0.00%) 

LanguageDetector.detectLanguage#ll(#=1701619, tlme=2859452.22ms, avg=0.60#/ms 1.68ms/#, 3.47%) 

LanguageIdentlfler.addProflle(#=28, tlme=29.13ms, avg=0.96#/ms 1.04ms/#, 0.00%) 

Languageldentlfler#matchlanguageproflle(#=1701619, tlme=2551253.69ms, avg=0.67#/ms 1.50ms/#, 3.10%) 

LanguageProflle.dlstanceInterleaved#total(#=47645332, tlme=2539920.08ms, avg=18.76#/ms 0.05ms/#, 3.08%) 

LanguageProflle. Interleaved. update(#=1701647, ttme=216333.19ms, avg=7.87#/ms 0.13ms/#, 0.26%) 

LanguageProf lie . dlstancelnterleaved#dlst (#=47645332 , tlme=2236326.36ms, avg=21.31#/ms 0.05ms/#, 2.71%) 

LanguageProflle#profllewrlter(#=1701619, tlme=306254.26ms, avg=5.56#/ms 0.18ms/#, 0.37%) 

LanguageDetector .detectLanguage#ld(#=1701562, tlme=2466748.82ms, avg=0.69#/ms 1.45ms/#, 2.99%) 

FuzzyHashAnalyzer (#=1701619, tlme=337414.28ms, avg=5.04#/ms 0.20ms/#, 0.41%) 

WARCIndexer.extract#hashstreamwrap(#=2089123, tlme=3746222.06ms, avg=0.56#/ms 1.79ms/#, 4.55%) 

WARCIndexer.extract#analyzetlkalnput (#=1740156, tlme=64287494.76ms, avg=0.03#/ms 36.94ms/#, 78.01%) 

WARCPayloadAnalyzers.analyze#total(#=1740156, tlme=64284418.35ms, avg=0.03#/ms 36.94ms/#, 78.00%) 

WARCPayloadAnalyzers.analyze#flrstbytes(#=1740156, tlme=17903.05ms, avg=97.20#/ms 0.01ms/#, 0.02%) 

ARCNameAnalyzer . analyze(#=1740156 , tlme=13765.56ms, avg=126.41#/ms 0.01ms/#, 0.02%) 

XMLAnalyzer . analyze(#=7975 , tlme=7579.49ms, avg=1.05#/ms 0.95ms/#, 0.01%) 

WARCPayloadAnalyzers.analyze#arcname(#=1740156, tlme=14751.68ms, avg=117.96#/ms 0.01ms/#, 0.02%) 

WARCPayloadAnalyzers.analyze#tlkasolrextract(#=1740156, tlme=33048823.86ms, avg=0.05#/ms 18.99ms/#, 40.10%) 

TlkaExtractor . extract#detect (#=1740156 , tlme=8280514.40ms, avg=0.21#/ms 4.76ms/#, 10.05%) 

TlkaExtractor.extract#extract(#=1735438, tlme=616945.84ms, avg=2.81#/ms 0.36ms/#, 0.75%) 

TlkaExtractor. extract#parse(#=1735438, tlme=23454988. 10ms, avg=0.07#/ms 13.52ms/#, 28.46%) 

ImageAnalyzer.analyze#facesanddomlnant(#=2547, tlme=1933767.38ms, avg=0.00#/ms 759.23ms/#, 2.35%) 

WARCPayloadAnalyzers.analyze#drold(#=1740156, tlme=10547860.60ms, avg=0.16#/ms 6.06ms/#, 12.80%) 

HTHLAnalyzer . analyze#total(#=1683855 , tlme=18072449.77ms, avg=0.09#/ms 10.73ms/#, 21.93%) 

HTHLAnalyzer . analyze#parser (#=1683855 , ttme=12592642.58ms, avg=0.13#/ms 7.48ms/#, 15.28%) 

HtmlFeatureParser.parse#jsoupparse(#=1683855, tlme=9542197.47ms, avg=0.18#/ms 5.67ms/#, 11.58%) 

HtmlFeatureParser.parse#featureextract(#=1683855, tlme=2142947.86ms, avg=0.79#/ms 1.27ms/#, 2.60%) 

PDFAnalyzer . analyze(#=5926 , tlme=636542.70ms, avg=0.01#/ms 107.42ms/#, 0.77%) 

WARCIndexer.extract#archeaders(#=2882905, tlme=693994.36ms, avg=4.15#/ms 0.24ms/#, 0.84%) 

>olrRecord.removeControlCharacters#total(#=235107394, tlme=1551471.12ms, avg=151.54#/ms 0.01ms/#, 1.88%) 

SolrRecord . santttseUTF8(#=235107394 , tlme=469633.81ms, avg=500.62#/ms 0.00ms/#, 0.57%) 
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